9 research outputs found

    Parallel algorithms for lattice QCD

    Get PDF
    SIGLEAvailable from British Library Document Supply Centre- DSC:D81971 / BLDSC - British Library Document Supply CentreGBUnited Kingdo

    An In-Depth Analysis of the Slingshot Interconnect

    Full text link
    The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyperscale datacenters networks with at most three switch-to-switch hops. Moreover, Slingshot provides efficient adaptive routing and congestion control algorithms, and highly tunable traffic classes. Slingshot uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. We analyze the extent to which Slingshot provides these features, evaluating it on microbenchmarks and on several applications from the datacenter and AI worlds, as well as on HPC applications. We find that applications running on Slingshot are less affected by congestion compared to previous generation networks.Comment: To be published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '20) (2020

    Optimised Collectives on QsNet II

    No full text
    In this paper we present an in-depth description of how QsNet II supports collectives. Performance data from jobs run on 256-1024 node clusters show that the time to complete barrier synchronization is as low as 5 microseconds, with very good scalability. Results for broadcast indicate that QsNet II can deliver data to 512 nodes in 8-10 microsecs and can sustain an asymptotic bandwidth in excess of 800 Mbytes/sec to all nodes. A 512-node reduction (64-bit floating point sum) completes in 18 microsecs. A gather of a single word from 1024 nodes completes in 30 microsecs. The all-to-all collective delivers 350 Mbytes/sec/node (85 % of peak) across 512-1024 nodes

    An In-Depth Analysis of the Slingshot Interconnect

    No full text
    The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe SLINGSHOT , an interconnection network for large scale computing systems. SLINGSHOT is based on high-radix switches, which allow building exascale and hyperscale datacenters networks with at most three switch-to-switch hops. Moreover, SLINGSHOT provides efficient adaptive routing and congestion control algorithms, and highly tunable traffic classes. SLINGSHOT uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. We analyze the extent to which SLINGSHOT provides these features, evaluating it on microbenchmarks and on several applications from the datacenter and AI worlds, as well as on HPC applications. We find that applications running on SLINGSHOT are less affected by congestion compared to previous generation networks

    Shared Memory Programming on the Meiko CS-2

    No full text
    An interesting feature of some recent parallel computers is the fact that the underlying transport mechanism behind the currently dominating message passing interfaces is based on a global address space model. By accessing this global adress space directly most of the inherent delays for administering message buffers and queues can be avoided. Using this interface we have implemented a user level distributed shared memory layer using the virtual memory protection mechanisms of the operating system. The synchronisation required for maintaining the coherency of the memory is addressed by implemtenting a distributed shared lock which exploits the remote atomic store operations provided by the Meiko CS-2. This allows an asynchonous style of programming where the load is dynamically distributed over the nodes of a parallel partition

    Not all applications have boring communication patterns: Profiling message matching with BMM

    No full text
    Message matching within MPI is an important performance consideration for applications that utilize two-sided semantics. In this work, we present an instrumentation of the CrayMPI library that allows the collection of detailed message-matching statistics as well as an implementation of hashed matching in software. We use this functionality to profile key DOE applications with complex communication patterns to determine under what circumstances an application might benefit from hardware offload capabilities within the NIC to accelerate message matching. We find that there are several applications and libraries that exhibit sufficiently long match list lengths to motivate a Binned Message Matching approach

    General in-network processing – time is ripe!

    No full text
    Remote memory access (RDMA) networks have been around for more than a decade. RDMA hardware enables basic put/get operations into userlant at very high speeds and reduces CPU overheads significantly. However, we observe that CPU requirements for processing data at modern speeds of 400 or 800 Gbit/s are still huge. Modern smart NICs add various processing capabilities ranging from fully-fledged ARM cores to FPGA-accelerated NICs. However, all current implementations are either relatively inefficient for line-rate packet processing or offer only limited functions such as header rewriting. We advocate for a fully flexible model that allows to execute arbitrary C code on each packet. We show that 'streaming Processing in the Network' (sPIN) enables such a model. Our implementation based on RISC-V demonstrates that generic network acceleration is feasible and delivers an efficiency improvement of up to 100x. We release our implementations as open source and expect that more vendors will adopt generic in-network computations in addition to RDMA
    corecore